(.packages())
## [1] "corrplot" "corrr" "openxlsx" "plotly" "ggplot2"
## [6] "formattable" "tidyr" "dplyr" "stats" "graphics"
## [11] "grDevices" "utils" "datasets" "methods" "base"
The provided data is organized in such a way, that for each patient there are several rows. Each one of them describes a single moment of time in which a measurement of a certain parameter occurred. Because of this approach there are a lot of NA values in the data both rowwise and columnwise.
| Rows.in.the.dataset | Columns.in.the.dataset | Decisive.attributes | First.admission | Last.discharge |
|---|---|---|---|---|
| 6120 | 84 | 78 | 2020-01-10 15:52:20 | 2020-03-04 16:21:51 |
| Gender | Number of cases |
|---|---|
| Male | 224 |
| Female | 151 |
To create a correlation matrix all measurements of every patient have to be aggregated into a single row. Hence an aggregation method must be chosen for columns containing more than one value. In the following block there are three different data frames created. Each of them utilizes a different aggregating method - mean, max and last. The “last” method means that only the most recent data is taken into consideration. Then all of these data frames are used to create three correlation data frames with the use of a package names corrr which allows to omit the phase of creating a correlation matrix and converting it into a data frame. In the following blocks and explainations I will refer to these three methods as “max”, “mean” and “last correlations”.
numeric_data_mean <- data %>%
group_by(PATIENT_ID) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
select(-PATIENT_ID)
numeric_data_max <- data %>%
group_by(PATIENT_ID) %>%
summarise(across(where(is.numeric), median, na.rm = TRUE)) %>%
select(-PATIENT_ID)
numeric_data_last <- data %>%
group_by(PATIENT_ID) %>%
fill(everything()) %>%
filter(row_number() == n()) %>%
ungroup() %>%
select(where(is.numeric)) %>%
select(-PATIENT_ID)
The library corrr allows to select concrete attribute that it needs to “focus” on, which means that it will filter out all the correlations not connected to the selected attribute. In this study we want to determine which attributes can cause which outcome of the disease, so the focused attribute is “outcome”. The results are shown below in a form of bar plots. To maintain readability of the plots only correlations higher than 0.6 or lower than -0.6 are shown. The plots can be clicked to show values of the correlations.
# Mean correlation
p <- correlate(numeric_data_mean, quiet=TRUE) %>%
focus(outcome) %>%
mutate(term = reorder(term, outcome)) %>%
filter(outcome > 0.6 | outcome < -0.6) %>%
ggplot(aes(term, outcome)) +
geom_col() + coord_flip() +
labs(title="Mean correlation")
ggplotly(p)
# Max correlation
p <- correlate(numeric_data_max, quiet=TRUE) %>%
focus(outcome) %>%
mutate(term = reorder(term, outcome)) %>%
filter(outcome > 0.6 | outcome < -0.6) %>%
ggplot(aes(term, outcome)) +
geom_col() + coord_flip() +
labs(title="Max correlation")
ggplotly(p)
# Last correlation
p <- correlate(numeric_data_last, quiet=TRUE) %>%
focus(outcome) %>%
mutate(term = reorder(term, outcome)) %>%
filter(outcome > 0.6 | outcome < -0.6) %>%
ggplot(aes(term, outcome)) +
geom_col() + coord_flip() +
labs(title="Last correlation")
ggplotly(p)